Description

Background & Context

The Thera bank recently saw a steep decline in the number of users of their credit card, credit cards are a good source of income for banks because of different kinds of fees charged by the banks like annual fees, balance transfer fees, and cash advance fees, late payment fees, foreign transaction fees, and others. Some fees are charged to every user irrespective of usage, while others are charged under specified circumstances.

Customers’ leaving credit cards services would lead bank to loss, so the bank wants to analyze the data of customers and identify the customers who will leave their credit card services and reason for same – so that bank could improve upon those areas

You as a Data scientist at Thera bank need to come up with a classification model that will help the bank improve its services so that customers do not renounce their credit cards

You need to identify the best possible model that will give the required performance

Objective

  1. Explore and visualize the dataset.
  2. Build a classification model to predict if the customer is going to churn or not
  3. Optimize the model using appropriate techniques
  4. Generate a set of insights and recommendations that will help the bank

Data Dictionary:

Loading dataframe from csv file

Checking shape and random sample data from Dataframe

Checking data types of the columns in the dataframe

Removing the 'CLIENTNUM' column because it does not have impact on any column

No duplicate row found in the dataframe

Checking for no. of missing values in the columns

Column 'Education_Level' and 'Marital_Status' has null value, we will update null value using imputation

Summary of the dataset.

Column 'Income_Category' has 'abc' value. Which is totally different from other value, we will treat as a missing value and update value using imputation

Data visualization

Univariate Plot

Checking for Age

16% customers are 'Attrited Customer'.

Customer_Age data is well distributed.

53% customers are Female.

Most customers has 2 or 3 dependents. Then customers has 1 or 4 dependents.

31% customers are 'Graduate', then customers are 'High School' or 'Uneducated'.

Most customers are 'Married'. Then customers are 'Single' and few are 'Divorced'.

Most customers has less than $40K income.

Maximum customers has 'Blue' card.

Months_on_book data is well distributed. 50% customers are between 31 to 40 range.

Most customers has 3 relationship.

75% customers are inactive only for 3 months.

75% customers are contacted 3 times in 12 months.

The distribution of Credit_Limit is heavily right-skewed with a median is 4549. Also has lots of outlier.

The distribution of Total_Revolving_Bal is well balanced but most customers has Total_Revolving_Bal is near 0.

The distribution of Avg_Open_To_Buy is heavily right-skewed with a median is 3474. Also has lots of outlier.

The distribution of Total_Trans_Amt is heavily right-skewed with a median is 3899. Also has lots of outlier.

The distribution of Total_Trans_Ct is right-skewed with a median is 67.

The distribution of Total_Ct_Chng_Q4_Q1 is heavily right-skewed with a median is 0.702. Also has lots of outlier.

Bivariate Analysis

Heat map

Key observations of numerical variables

Summary of EDA

Data Description:

Observations from EDA:

Data Pre-Processing

Feature Engineering

Missing-Value Treatment

The values obtained might not be integer always which is not be the best way to impute categorical values

Define dependent and independent variables

Imputing Missing Values

Creating Dummy Variables

Building the model

Model evaluation criterion:

Model can make wrong predictions as:

  1. Predicting a customer will buy the product and the customer doesn't buy - Loss of resources
  2. Predicting a customer will not buy the product and the customer buys - Loss of opportunity

Which case is more important?

How to reduce this loss i.e need to reduce False Negatives?

Oversampling train data using SMOTE

Undersampling train data using Random Under Sampler

Hyperparameter Tuning

Xgboost with Oversampling train data has highest cross-validated recall then Random Forest with Oversampling train data has highest cross-validated recall and then Gradient Boosting with Oversampling train data has highest cross-validated recall. We will tune Random Forest, Gradient Boosting and xgboost models using RandomizedSearchCV. We will also compare the performance of these three models.

First, let's create two functions to calculate different metrics and confusion matrix so that we don't have to use the same code repeatedly for each model.

Random Forest Classifier

Now, let's see if we can get a better model by tuning the random forest classifier. Some of the important hyperparameters available for random forest classifier are:

Gradient Boosting Classifier

XGBoost Classifier

Comparing all tuned models

Performance on the test set

Pipelines for productionizing the model

Column Transformer

Business Insights and Recommendations